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This paper studies l\ regularization with high-dimensional features for support vector machines 
with a built-in reject option (meaning that the decision of classifying an observation can be 
withheld at a cost lower than that of misclassification). The procedure can be conveniently 
implemented as a linear program and computed using standard software. We prove that the 
minimizer of the penalized population risk favors sparse solutions and show that the behavior 
of the empirical risk minimizer mimics that of the population risk minimizer. We also introduce 
a notion of classification complexity and prove that our minimizers adapt to the unknown 
complexity. Using a novel oracle inequality for the excess risk, we identify situations where fast 
rates of convergence occur. 
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1. Introduction 

In this paper we further investigate the new classification rules introduced in [1, 11] with 
a built-in reject option in the standard binary classification setting, where we observe 
independent realizations (Xi,Yi), i = 1, . . . , n, of a random pair (A, Y) in X x { — 1, +1} 
(here, X is an arbitrary space). A discriminant function / : X — > M classifies an observation 
x e X into one of two classes, labeled —1 or +1. Viewing f(x) as a proxy value of the 
conditional probability rj(x) = P{Y = 1\X = x}, we are less confident for small values of 
|/(a;)|, corresponding to r](x) near 1/2. Our strategy is to report sgn(/(a;)) <G { — 1,1} if 
|/(a;)| exceeds some prescribed threshold r and withhold decision otherwise. Assuming 
that the cost of making a wrong decision is 1 and that of withholding a decision is d, the 
appropriate risk function is 

Re(f) = E[£(Yf(X))} = P{Yf(X) < -r} + dF{\Yf(X)\ < r} 
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with the discontinuous loss function 

( 1, if z < -r, 
t(z) = l d, if \z\<t, 
{ 0, otherwise. 

Since we always reject if d = and never reject if d > 1/2 (see [5] ) , we take 0<e?<l/2in 
what follows without loss of generality. Although the minimizer of this risk is not unique, 
all such minimizers correspond to the unique classification rule that assigns — 1,+1 or 
withhold decision, depending on which of 1 — rj, 77 or d is smallest. The smallest risk 
is E[min{?y(X), 1 — r](X),d}} and we may interpret the cost d as the largest conditional 
probability of misclassification that is considered tolerable. 

In practice, minimization of the empirical counterpart Ri(f) = (1/w) X)"=i ^(Xifi-^-i)) 
of Ri{f) over a large class of functions / is computationally not feasible. For this reason, 
we could replace the loss function £ by a convex surrogate loss function and consider 
discriminant functions / of the form ^\{x) = X)j=i ^jfj ( x ) based on a set of known 
functions fj : X — > E and coefficients et, 1 < j < M . Following [1], we will consider 
the generalized hinge loss 

{1 — az, if z < 0, 
1-z, if0<z<l, 
0, otherwise 

with slope a = (1 — d)/d > 1. Observe that 4>(z) is pieccwise linear, so that minimization 
of the empirical risk 

1 - 

R4h) = -J2^ Y Mx t )) (1.1) 

n * — ' 

can be solved by a tractable linear program. Crucial for the choice of <fi(z) is that it is 
classification calibrated: the unique minimizer 

!-l, if r](x) < d, 
0, if d < r](x) <l-d, 
+1, if r](x) > 1 - d 

of i? (/) = E[(j>(Yf(X))} also minimizes the risk R e {f) = E[£(Yf(X))] over all measurable 
/: X — > M for all r < 1; see, for example, [1, 12]. 

At this point it is important to note that truncating the minimizer sgn(2?7 — 1) of the 
hinge-loss-based risk E(l — Yf(X)) + does not yield the optimal rule for any positive 
threshold r. This is the reason why we generalize the hinge loss instead. In addition 
to the generalized hinge loss, there are also other choices of the surrogate loss function 
and corresponding truncation value r that are classification calibrated. The treatment 
for the generalized hinge loss differs considerably from that for other losses, such as 
the logistic, exponential and quadratic loss, which are smoother. We refer to [12] for a 
detailed discussion. 
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Observe that <f>(z) > £(z) for all r < 1 - d and, subsequently, E[£(Yf(X))] < 
E[<fi(Y f(X))]. It is shown in [1] that a similar relationship remains true for the excess 
risks, that is, the inequality 

E[£(Yf(X))]-E[l(Yf (X))] <E[cf>(Yf(X))} - E[c/>(Y f (X))] 

holds for all d < r < 1 — d. This property is useful for deriving oracle inequalities in terms 
of the ^-risk since minimization of (1.1) produces oracle inequalities in terms of the </>-risk 
rather than the £-risk directly. 

Of particular interest here is the case where the number of basis functions, M, is large 
when compared with the sample size n. Usually, the minimization of the empirical risk 
R<t>{f\) is computed under a restriction on the quadratic term 5^ 3 - =1 A|. Here, we opt 
instead for an £i-type restriction || A||^ 1 :=X)j=i ancl estimate f\ by f^( r )j where 

A(r) :=argmin(i? (f A )+r!|A||, 1 ) (1.2) 

AGK M 

and r > is a tuning parameter. The choice of an l\ penalty reflects our preference for 
sparse solutions, which is desirable when M is large. 

In the remainder of this paper, we study the properties of A(r) and its population 
counterpart, 

A(r) :=argmin(^(f A )+r||A|| fc ). (1.3) 

\es. M 

We establish oracle inequalities for X(r) and A(r) in Sections 2 and 3, respectively. The 
results that we obtain are similar in spirit to those from [6, 8, 11]. However, [8, 11] do 
not discuss properties of A(r), and our results in Section 2 obtained here extend those 
proved by [6] in the context of twice differentiable loss functions. Furthermore, the oracle 
inequalities for the penalized empirical risk minimizer A(r) in Section 3 are much sharper 
than earlier results from [11] for < d < 1/2 and [8] for d = 1/2. In particular, the new 
inequality reveals that the rate of convergence of the excess risk of f A can be even faster 
than 1/n if the optimal discriminant function fo can be written as a linear combination 
of the /j's in the dictionary. Moreover, we relax the condition on the dictionary and do 
not require that the parameter A is bounded. We emphasize that our results hold, in 
particular, for d = 1/2, the case of support vector machines without a reject option, and 
generalize and extend the results obtained in [8] . In addition, novel empirical bounds on 
the error and reject rate are given. To demonstrate the feasibility of the £i-regularizcd 
support vector machine with a reject option, in Section 4 we formulate A(r) as a solution 
of a linear program and report some numerical experiments. Some technical lemmas and 
a maximal inequality for a weighted empirical process are collected in the Appendix. 
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We begin by studying A(r), the population version of A(r). Recall that A(r) is defined by 

A(r)=arg min {i? (f A ) + r\\ A||/J. (2.1) 

AeR JJ 

In particular, A(0) minimizes the risk R^(f\) over AGR M . By definition, we find that 

^(f A(r) ) + r||A(r)|U 1 <^(f A ) + r||A|k (2.2) 
holds for all A € K . This inequality applied to A = A(0) has the following consequences. 

Proposition 2.1. Let Iq = {i : \{(0) ^ 0} be the support of A(0). 

(a) If\\X(0)\\ ei =o(l/r) asr->0, then i^(A(r)) -> i^(A(0)) asr^O. 

(b) ||A(r)||^ < ||A(0)||^ forallr>0. 

(c) E*/ |A,M - A, (0)| < E j6/0 |Ai(r) - A,(0)|. 

Proof. After applying inequality (2.2) to A = A(0) and using the fact that i?</,(fA(o)) < 
-R0(fA(r)), we get 

< Rt(h(r)) - R^x(o)) < r\\X(0)\Ui - r\\X(r)\\ tl < r||A(0)|| £l , 

which implies (a). The second claim follows from 

^(fA(r)) + r||A(r)||^<^(f A( o))+r||A(0)||^<^(f A(r) ) + r||A(0)||^. 

For the proof of part (c), we first observe that || A(-r) || ^ x < ||A(0)||^ 1 is equivalent to 

Ei A iWi<Ei^(°)i-Ei A i( r )i- 

j^-fo je-fn jei 

Next, we note that the term on the left equals X^/ |Aj (0) — A 3 (r)| and we bound the 
term on the right by X^ 6 / | Aj (0) — Xj(r)\ using the triangle inequality. This proves part 
(c). □ 

This result gives a simple condition for i?</,(fA(r)) — ^ R<f>(f\(o)) and shows that the i\ 
norm of the solution A(r) is always smaller than the l\ norm of A(0). Similar properties 
are established by [6] for minimizers of twice differcntiablc loss functions <f> and l v norms 
for p > 1. In contrast, we consider here a non-differentiable loss function (f> and p= 1. 

Our target is a sparse vector 9 £ M. M with risk R^e) close to i?0(f A (o))- Before we 
make this precise, we need to introduce a few concepts depending on the behavior of 
r](X) near d and 1 — d, and the set of functions fj. 



1372 M. Wegkamp and M. Yuan 

Definition 2.2 (Classification complexity). The classification complexity is defined 
as the largest number a > such that, for some A > 1 and all t> 0, 

V{\t](X)-d\<t}<At a and ¥{\rj(X) - (1 - d)\ < t} < At a . 

This notion of complexity is a generalization of Tsybakov's margin condition [9] for 
d = 1/2. The behavior of 7](X) is obviously not relevant in the interval (d, 1 — d), only at 
the endpoints d and 1 — d. The inequality always holds for a = and A = 1. In contrast, 
a = +oo describes the easiest classification situation where we essentially require that 
rj(X) stays away from d and 1 — d with probability one. If r](X) has a density in the 
neighborhood of d and 1 — d, then we have that a = 1 . 

Definition 2.3 (Restricted eigenvalue condition) . Let 9 e M. , c > 1 and ^ 6e the 

M x M mafra with entries = AE[f i {X)f j (X)uj(X)} with lj(X) = r/(X){l - n(X)}. 
For I = {i : 9i 0} , the support of 9, we define 

A^GR M :||(e-A) l0 |U 1 <c||(9-X) I |U 1 4|| (0 - A)j||| 2 

The condition k(9, c) > is a restrictive eigenvalue condition on the Gram matrix 'J 
of the type introduced in [2] in the context of linear regression. Using similar reasoning 
as in [2], page 1714, it is implied by the local mutual coherence condition used in [11]. 
We are now in position to state an oracle inequality for the excess risk, 

^■R<i>(h(r)) : = R<i>{h(r)) - R<t>(fo), (2.3) 
of the regularized minimizer A(r) and the ^i-distance between the vectors A(r) and 9. 

Theorem 2.4. Let a be the classification complexity, and 9 be such that R(fg) < R(^\(r)) 
and k = k(9, 1) > 0. Then, for any 

r< {2C F y {2+a)/a {AA{2d) a }- 1 ' a {K- 2 \\9\\ Ui )- {1+a),a (2-4) 
with Cf = niaxj 1 1 /j 1 1 oo = maxj supj/^cc)! and \\9\\t = X^=i 0}; we have 

AR^ Mr) ) + r\\X(r)-9\\ £l 

(2.5) 

< 3Ai^(f e ) +6{4A(2dn 1 /(2+«)||f fl - / o || oo (^ 2 r 2 ||0||, o )( 1 + Q )/( 2+Q ). 

Proof. Set S = A(r) — 9. Let / = {i : 9i ^ 0} be the support of 9. It is straightforward to 
derive from Proposition 2.1 that 

R <t ,(h(r))+r\\S\U 1 <R^e) + 2r\\S I \\ £l 

and, subsequently, that 

rllVIk < R^e) - i^(fx W ) + HI<Mk < r\\Si\\ tl . 



Reject SVM 1373 

The first inequality, combined with the assumption k = k(#,1)>0, yields 

AR^ x{ r))+r\\S\\ ei < Ai? (f e ) + K - 1 ||f A -||(r 2 |/|) 1/2 

< AR^fg) + K^Wh - /o||(r 2 |/|) 1/2 + K - 1 ||f 9 - /o||(r 2 |/|) 1/2 , 

using the notation ||f|| = E 1 / 2 [f 2 (X)u(X)] and w(X) = rj(X)(l - r))(X). By Lemma A.l 
in Appendix A. we find that 

\\h - M\ 2+2a < 4A(2d) a \\f x - /o||^ a {A^(f A )} Q 

for A = 9 and A = A(r). After we plug this bound into the right-hand side of the previous 
display, we find that 

AR^ x[r) )+r\\5\\ tl 

< AR^fe) + K - 1 (r 2 |/|) 1 /2 { 4A(2d)"} 1 /(2+2«)|| fA(r) _ /o ||^/( 2+2 «){A J R (f A(r) )r /(2+2Q) 
+ «:- 1 (r 2 |J|) 1 / 2 {4yl(2d)"} 1 /( 2 + 2 «>||f e - / 1| ^+ q )/^ 2 + 2q ) { Ai? (f e )} Q /^ 2 + 2 ") . 

Next, we apply Young's algebraic inequality, 

aP b q p 

ab < 1 with p > 1 and q = for all a, b > 0, 

p q p-1 

to the last two terms on the right-hand side, with p = (2 + 2a) / a and q = (2 + 2a) /(2 + a) , 
to get 

Ai^(f A(r) ) + r||<5||^ 

< AR^fg) + ^^i A Mh(r)) + AR^fg)} 

a -{4A(2d)«} 1 /(2 +Q ) (K -2 r2|/|) (i +Q )/( 2+Q)(||fAM _ /q|U + ||ffl _ /o|U) _ 



2 + 2a 

Since \\f\( r ) — /o||oo < — /o||oo + C'f|HUh we deduce, after invoking (2.4), that 
(2 + a)AR <p {f x{r} ) + (1 + 3a/2)r\\S\\ ei 

< (2 + 3a)AR4f e ) + 2(2 + a){4A(2d) Q } 1 /( 2+Q )( K - 2 r 2 |/|)( 1+Q )/( 2+Q )||f ( , - / |U, 
and the conclusion follows. □ 

It is interesting to see that the bound (2.5) crucially depends on the classification 
complexity parameter a and \\fg — /o||oo- In particular, if /o can itself be represented 
as a linear combination of the basis functions, then /o = fA(o) • In this case, provided 
that k(A(0), 1) > 0, Theorem 2.4 implies that Ai? (f AM ) + r|| A(r) - A(0) || £ x < 0. In other 
words, we have the following corollary. 
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Corollary 2.5. If fo = fx(o) an d K (A(0), 1) > 0, then A(r) = A(0) for any 
r < (2C F )- {2+a ' )/a {4A(2d) a }- 1/a (K- 2 \\\(0)\\ eo )- (1+a)/a . 

3. ^-regularized empirical generalized hinge risk 
minimizers 

In this section we study the estimate A(2r). In what follows, we will simplify notation so 
as not to show dependence of A on r whenever no confusion occurs. Again, we emphasize 
that our results hold, in particular, for d = 1/2, the case of a support vector machine 
without a reject option. 
Note that the inequality 

R^X) + 2r\\X\\ tl < MX) + 2r\\X\\ tl (3.1) 

applied to the vector of zeros A= (0,...,0)' implies that ||A||^ < 0(O)/(2r) = l/(2r). This 
means that we can restrict our analysis to the set 

A = {AeM M :||A||, 1 <l/(2r)}. 

The aim of this section is to show that A is close to A(r) for a judiciously chosen tuning 
parameter r. 

Theorem 3.1. //, for some p > 1, 



r > J 9 /2lQ g 2(MVn) 2 P^n /21ogl/* 

~ d \ V ?i V2M V 2ri V n J' 

tten /or <zZZ 9 £ A, loii/i probability larger than 1 — 5, 

Ai^(f x ) + r||A|| £l < Ai? (f e ) + Zrp\W + n~ P 

and, moreover, 

Ai? (f x )+r||A-0|| £l < AR^fg)+4r\\e\\ ei +n~ p . 

Proof. Write 6 = A — 9. Let e = r~ l n~ p and define 

~ -fiV(fA)} - {^(ffl) ~ -fiV(ffl)} / o o\ 

p ir\ ail TZ ' ^ ' 

By Propositions B.l and B.2 in Appendix B, 

V{f < r} > 1 - 5 
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for the choice r given in (3.2). Rewriting the inequality (3.1), we find that 

R<p(h) < R ^e) + We) - R^e)} - {R^W - Mh)} 

+ 2r||% 1 -2r||A||^ 1 (3.4) 
< R 4> (fg)+f(\\S\\ ll +e) + 2r\\e\\ ei - 2r\\X\\ h . 

Thus, on the event f <r, after adding *~|| to both sides, we obtain 

R^x)+r\\M\ ei <R^(fe)+3r\\e\\ tl +re, 

which proves the first claim. Adding J'H^j^ to both sides easily yields the second claim. □ 

A direct consequence of Theorem 3.1 is the following corollary which states that in 
the sparse setting where ' , ||A(r)||£ 1 — > 0, the estimator A(2r) behaves like the penalized 
minimizer A(r) in terms of their risk. 

Corollary 3.2. Suppose that r\\ X{r)\\e 1 — > as n — > oo for r satisfying (3.2). Then, with 
probability at least 1 — 5. 

|{i? (A)+r||A||, 1 }-{ J R (A(r))+r||A(r)|| fl }|^O 

as n — > oo. In particular, when taking 8 — A(0) ; we have \R^(X) — i?^,(A(0))| — > and 
||A(2r)-A(0)|k=o(l/r). 

Proof. We combine the basic property (3.1) applied to 8 — A(r) and Theorem 3.1, and 
we find that on the event f < r, 

fl*(A(r)) + r|| A(r)|| £l < R^X) + r\\X\\ tl < R*(\(r)) + r\\X(r)\\ tl + {2r\\X(r)\\ tl + re}. 

The result then follows from {2r\\ A(r)||^ + re} -> 0. □ 

We emphasize that the above results do not impose any restrictions on the dictionary 
{fj}. If we are willing to make assumptions on the Gram matrix 'J, then we obtain a 
more refined result . 

Theorem 3.3. For all r satisfying (3.2) and 8 £ A such that K = k(8, 7) > and 

(K 2 r a/(l +Q )|| || fo) (l+a)/(2 +Q)<c ^ 

for some (small) c depending on Cf, ol, A and d, we have, for some C depending on c, 
that 

AJ^ft) + \r\\X- 9\\ tl < 3AR48) + C\\fg - /o||oo(K-V||0|| / J< 1+a >/< a+a > + 
holds with probability at least 1 — 5 . 
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Proof. Recall that e = r~ 1 n~ p . We may assume without loss of generality that 

R4f e )+er<R4f x ) + ±r\\5\\ ei (3.6) 

holds, since otherwise the statement holds trivially. Consequently, on the event f < r, 
using (3.4) and (3.6), we get 

R+tfx) < R*(fo)+er + r\\&\U 1 +2r\\6\\ tl -2r\\X\\t 1 
<i^(fx) + |r||*|| <1 +2r||% 1 -2r||A|| /l 
= R+(f x ) + %r\\$i\\ tl + |r||A/c||^ + 2r||0||* 1 - 2r||A|| £l 
= J^(f x ) + 1^1/11^+2^011^ - 2r||A J || <I - \rWhA\li 

so that |ji5/c|| fl < 7||5/||f i; where / is the support of 6. On the other hand, 

Mh) + \A\5\W < R*{U)+er + lr\\S\\ tl + 2r\\0\\ tl - 2r\\X\\ tl 

<R*(fo)+er+%r\\5i\\i 1 +2r\\6i\\ il -±r\\\i4t 1 



2 II 1 IKi 1 II J 11*1 2 

The remainder of the proof follows that of Theorem 2.4, with n = k(6, 7). □ 



<R4,(fe) + lr\\S I \\ tl +re. 



This result differs from [11] (and [8] for the case d = 1/2) in the appearance of the norm 
1 1 fo — fe||oo 011 the right-hand side of the (oracle) inequality. This implies that for f Q = fg 
and for some sparse 6 = A(0) satisfying the conditions of Theorem 3.3, we can expect 
fast rates, regardless of the classification complexity! Another important difference with 
both papers is that no restriction is imposed on the sup- norm of f \ . Such a condition is 
unnatural as |f^| < C may overrule the restriction that the penalty term rHAH^ imposes. 

We now consider bounds on the error and reject rates without an additional test 
sample. We write 

n 
i=l 

for any f3 > 0. The misclassification and rejection rate can be bounded above as follows. 
Theorem 3.4. If 



r ( )>^L / 21og2(Af Vn) | 2p\og 2 (n)C F | C F /21og(l/(5) 
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then, with probability at least 1 — S, we have 

F{Yh(X) < -r} < mm[P n {Yf x (X) < -r + 7} + r( 7 )|| A||/J + n~ p , 

7>0 

P{|f X W| < r} < min[P„{|f x (X)| < r + 7 } +r( T )||A|UJ +n"*. 

7>U 

Proof. Set 

{1, if z < -t, 

-(j-r-z), if -t<z<-t + 7, 
0, ifz>-r + 7. 

The following inequalities then hold uniformly in A: 

HYhiX) < -r} < P„{Ff A (X) < -r + 7} + fl^(f A ) - R^ (f A ) 
< P„{yf A (X) < -r + 7} + f {||A|k + e}, 

where 

|^ 7 (fA)-i^,(fA)| 



To = Slip 



^ 11 \ 11 1 5 

with e given by £f( 7 ) = n~ p . We can invoke Propositions B.l and B.2 to complete the 
proof of the first claim. The proof of the second claim uses the reasoning above, with the 
only modification being that v? 7 (z) is now given by 

fl, if|z|<T, 

— (z + 7 + r), if — r — 7 < z < — t, 



ip 1 (z) = < 



1 

(z — 7 — t), if r < 2 < t + 7, 

7 

L 0, if|z|>r + 7 ; 



the rest of the reasoning is unchanged. □ 

4. Numerical experiments 

We now demonstrate the practical merits of A(r) via a couple of numerical experiments. 
We begin by noting that the computation of A(r) can be conveniently formulated as a 
linear program. Let £1 , . . . , be the slack variables such that 



6>0, ^>l-Yif k {Xi), £i>l -aYif x (Xi). 



(4.1) 
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Clearly the minimum that satisfies these constraints is 4>(Yif(Xi)). We also introduce 
slack variables £, n +i, i = 1, . . . , M, to represent |Aj|, that is, 

£n+i>Ai, • A,. (4.2) 

Using the slack variables, A(r) can be given as the solution of the linear program 

mill [6 H h + K&i+l H h Cn+Af)] 

subject to 

& > 0, &>l-yAj £i>l-aYihi, i = l,...,n, 
€n+i > A,;, ^n+i > -Aj, i = l,...,M, 
fe» = y^Aj/j(Xj), i = l,...,n. 

To illustrate the merits of A, we implement the method described above and first apply 
it to a set of simulated examples. To fix ideas, we set d = 0.25 or, equivalently, a = 3. 
For each run, 50 positive instances (Y = +1) and 50 negative instances (Y = —1) were 
generated. Two hundred (M = 200) features (//s) were simulated from a multivariate 
normal distribution. For positive instances, the mean was set to (1/V2, l/\/2, 0, . . . , 0)', 
whereas for the negative instances, the mean was set to (— 1/V2, — l/y/2, 0, . . . , 0)'. In 
both cases, the covariance matrix was the identity matrix. The operating characteristics 
of the method are demonstrated in Figure 1. On the left-hand side, the misclassifica- 
tion rate (¥(Yf^(X) < -0.5)), rejection rate (P(|Ff^(X)| < 0.5)) and associated ^-risk of 
the ^-regularized generalized hinge loss (-R^(fc)) are plotted as functions of the tuning 
parameter r for a typical simulation. The results are to be compared with the usual 
£ i-regularized support vector machines where no rejection option is allowed. Since there 
is no rejection, the misclassification rate for the usual support vector machines coincides 
with its ^-risk. It is evident that by incorporating the rejection option, A yields a smaller 
f-risk, provided that both methods are optimally tuned. To further investigate the merits 
of allowing the rejection option, we repeated the experiment 200 times. The excess risk 
ARt of both the usual support vector machine and the proposed method are summarized 
in the plot on the right-hand side. It further confirms the advantage of A. 

To further demonstrate the merits of the method, we apply it to the mixture data 
example considered in [4]. The training data consist of 200 data points generated from 
a pair of two-dimensional mixture densities. Similarly to [4], we consider a dictionary 
of Gaussian radial basis functions fj(-) = exp(— 2|| • — bj\\ 2 ),j = 1, . . . , 100, where the lo- 
cations bj are placed on a 10 x 10 equally spaced lattice. To fix ideas, we consider the 
case where d = 0.25. The optimal classification rule will classify an observation as +1 
if the corresponding conditional probability P(Y = +1|X) is greater than 0.75 and as 
— 1 if the conditional probability is less than 0.25. When the conditional probability is 
between 0.25 and 0.75, we withhold the decision. The corresponding decision boundaries 
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Missclassification Rate (no rejection) 
Missclassification Rate (with rejection) 
Rejection Rate (with rejection) 
Risk (with rejection) 



o - o - o - o - o - o 

O O O O O o v 
o o o o o n 



o - o - o - o - o - 



>- o - o-o\ 

, °-o 

° 0-0 o 



O 

' o - o - o - o X o 
O o o o 



0.01 0.02 



0.05 0.10 0.20 



0.50 



No Rejection 



with Rejection 



Figure 1. Simulation - the effect of rejection, misclassification rate and excess risk Re- The 
left-hand panel shows the three criteria as functions of the tuning parameter r for the support 
vector machine (SVM) with rejection option for a typical run. Also included is the misclas- 
sification rate for the usual SVM. It is evident that SVM with rejection option enjoys lower 
misclassification rate by withholding decision for "hard-to-classify" cases. The right-hand panel 
compares the excess ^-risk for SVM with or without rejection option. The box plots of the excess 
risk are produced based on 200 runs. This again confirms that SVM with rejection option leads 
to improved performance in terms of the £ loss. 



are given in the right-hand panel of Figure 2. It is known that the usual SVM only tar- 
gets the decision boundary identified with P(Y = +1|X) and cannot be used to recover 
the optimal decision boundaries given here; see, for instance, [12] for further discussion 
of this issue. In contrast, the SVM with rejection option is devised specifically for this 
purpose. To this end, we ran the SVM with rejection option with a = 3 and r = 0.5, as 
discussed earlier. The tuning parameter r was selected by tenfold cross-validation. The 
left-hand panel of Figure 2 gives the estimated decision boundaries. It is clear from the 
plot that SVM with rejection option successfully captured the main characteristics of the 
underlying probabilities. The main difference between the two sets of decision boundaries 
occurs in regions where no observations are available. As a result, the SVM with rejection 
option opted for withholding a decision. 
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Figure 2. Mixture data - optimal and estimated decision boundaries. The left-hand panel 
gives the optimal decision boundary, whereas the right-hand panel corresponds to the SVM 
with rejection option. In both plots, positive cases are represented by red circles and negative 
cases by green triangles. The light red regions correspond to classification Y = +1 and light 
green regions to classification Y = — 1. Areas where a decision is withheld are not shaded. The 
solid black line in the left-hand panel is the level set for F(Y = +1|X) = 0.5. The solid black line 
in the right-hand panel is the level set for = 0. 



Appendix A: Connection between excess risk and 
weighted L 2 norm 

The next lemma is a technical result that links the excess risk Ai?^(A) to the Li norm: 



Mfx - /o|| = v/E[|fA(X)-/ (X)|M*)] 

with ui(X) = rj(X)(l — rf){X). Its proof is rather technical and relies on results obtained 
in [1]. Essentially, ||f^ — /o||oo replaces the suboptimal bound 1 + CaCf in [11]. 
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Lemma A.l. Let a> be as in Definition 2.2. Then, for all A <E R M , 

||fA - ./o!| 2+2Q < 4A(2d) a ||f A - /o!| 2 oo + "{Ai? (A)} Q . (A.l) 

Proof. Let / : X — > R be arbitrary and set 

f J7|/-/ |, if r] < d and / < -1, 

ft?(/,/o) = < (l-»7)|/-/o|, if »7> 1 -d and /> 1, 
(j/-/o|, otherwise, 

then [1], Lemma 9, states that 

Ai^(A) > rf- 1 E[p I) (/,/ )(^)(|r ? (X) - (1 - d)\I {XeE _ } + \n(X) - d\I {XeE+} )} 

with 

E- = {\ v -(i- d)\ < \ v - d\}, E + = {\ v -(i-d)\> \ v - d\}. 

Using (A.l), for any set E, 

E[p tl (f,fo)(X)\ V (X)-(l-d)\I {XeE} } 
> tE{p rl (f,f )(X)I{\ ri (x)-(i-d)\>t}I{xeE}] 

= tE[p 77 (/,/o)(X)7 {XeS} ]-iE[^(/ J / )(X)I {WW _ (1 _ d) | <tiXGB} ] 
>tE[^(/,/ )(^)/{Xe£;}-||/-/o||oc^ a! ]. 

Similarly, 

E[p,(/,/ )(X)|^(A)-d|/ {XeS} ]>tE[ A , ; (/,/o)(A)/ {XeS} -||/-/ || 00 Ar], 
and we obtain 

Ai^(A) > d-HE[p n (h,f )(X)I {XeE+UE _ } - 2\\f x - M\ooAt a ] 
= d-HE[ Pr ,(hJo)(X) - 2\\h - M\ooAt a }. 

Plugging 

'E[ Pl) (f A ,/ )(A)]\ 1/a 



4A||fx-/o|| <x 
into the preceding expression, we obtain 

AR m> (Kk(fA,/o)(I)]) (1+tt)/ ° 
A ^ (A) " 2d(4A||f A -/o|| 00 )V- ■ 

Since 

||f A - /o|| 2 = E[u(X)(f x - f ) 2 (X)} < ||f A - / ||ooE[w(A:)|f A (A:) - f (X)\], 
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we get, for all A, 

(E[ W (X)|f A (X)-/o(X)|])( 1 +«)A* 



> 



2d(4A||f A -/ ||oo) 1 /° 

(||fA-/o|| 2 ) (1+Q)/Q 



2d(4Ay/-\\f x -f \\g +a)/a ' 
The claim follows. □ 

Remark A. 2. If |f\| < 1, then p v (f\, fo) = |f a — /o|- Hence, if we restrict the parameters 
A such that f\ arc bounded by 1, then we can impose the restricted eigenvalue condition 
on the matrix with entries E[f i (X)f j (X)] instead of E[f i (X)f j (X)oj(X)]. 



Appendix B: A maximal inequality for a weighted 
empirical process 

Recall that A = {A e E M :||A||^ < l/(2r)} and let 9 £ A and e > 0. Let tp : R ->• R be a 
convex function with Lipschitz constant C v and define the risks 

R„(f x ) =E[cp(Yf x (X))], 
1 " 

n ^ — ' 



Finally, let e > and set 



r(<p,9,e) = sup 



\{R„(h) - R v (h)} - {RM - RM}\ 



AG a \\6-X\\ ei +e 

We prove a maximal inequality for f(c/j, 9, e) which slightly generalizes the result obtained 
in [11]. 

Proposition B.l. Let < 5 < 1 and set 



r(<p, 9, e) = E[f fo>, 9, e)} + C V C P J 2log{1/6) . 

V n 

Then, 

P{r(<p, 9, e) > f((p, 6,e)}>l-5. 

Proof. First, observe that changing a pair (AT,, Yi) in f changes it by at most 2C v Cf /n. 
The result follows immediately after applying McDiarmid's exponential inequality [3], 
Theorem 2.2, page 8. □ 
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We now control the expectation of f((p,9,e). 
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Proposition B. 2. Set J = |"log 2 (l/{er})~| . T/ierc, 



E[f(<p,6,e)] <9C^C F 



21og2(MVn) 2JC V C F 



^(MVn) 



Proof. Let cti, . . . , <r n be independent Radcmachcr variables, taking the values ±1, each 
with probability 1/2, independent of the data (X, Yi), . . . , (X„, Y„). Set 



n 

R v ,(h) = -Y,crMYih(X i )). 



A standard symmetrization trick [3], page 18, shows that 
|fi°(f x )-fl°(f e )|- 



E[f(v3,6»,e)] < 21 



sup ■ 



< 21 



.AeA II A - 0\\ ei + e 

|fl°(f A )-fl°(f 9 )| 
sup — — 

||A-e|| £l <e 
= 0) + (#)- 

The first term 



IA-i 



sup 

e<\\X-6\\ tl <l/r 



|i%(f A )-i%(0)| 
|A-0|k +£ 



7 = 2E 



K(fA)-^(fe)| 



IIA-en^e l|A-^lk+e 



sup 



l|A-e|| £l <e l|A - 0\\^ +e 



1 " 

T> * * 



can be bounded using the contraction principle for Rademacher processes; see [7] , pages 
112-113. For this, we observe that the function g(z) = ip(zo + z) — tp(zo) is Lipschitz with 
Lipschitz constant C v and 5(0) = 0. Consequently, 



(/) < 2^1 



<2^E 

e 



< 2C V E 



sup 

IIA-eiU^e 



1 ™ 

- Va 4 yf A „ e (x) 



sup ||A — 0IU max 



i=l 



max 

l<j<M 



1 n 

« ^ — * 
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< 2C V C F 



V21og(2M) 



The last maximal inequality can be found in [3], Lemma 2.2, page 7, which uses the fact 
that the variables UiYifj(Xi) are sub-Gaussian, 



E 



exp< sy)(7iYifj(Xi) 



< exp(ns 2 Cj72) 



for all s, which follows, in turn, from [3], Lemma 2.1, page 5. 

The second term (II) requires a peeling argument [10], page 70. Since < f < 2C V C ' F 
almost surely, we can use the bound 



(//) < C + 2C V C F 1 
Observe that for any £ > 0, 
sup 

e<||A-6»|| fl <l/-r 
J 



sup 2- 



e<\\X-e\\ Cl <l/r 



\R%(h)-R%{h)\ 

wx-eu +e 



A- 



>c 



< 



Now, set 



E P I SU P \R v (h)-R° v (f8)\>y- 2 e(\. 

■ = 1 L2J- 1 e<||A-e|U 1 <2J' £ J 



Zj= sup |i?°(f A )-i?°(MI 

||A-e|U 1 <2J £ 



and the same considerations leading to the final bound of (I) above yield 



E[Zj] < 2 j eC (j> C F 



V21og(2M) 



and for t=l/y/2, we obtain 



(B.l) 



(II) <( + 2C v C F J2 F i Z J ~ E t Z i] - v ~ 2 < - H z j}}- 
i=i 

A change of a single pair changes Zj by at most 2C ip C F (2 : >e)/n, so that another 

application of the bounded differences inequality [3], Theorem 2.2, page 8, gives, by 
taking 



C — 7C V C F 



yj2 log2(MVn) 
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the final bound 

./ 

^P{Z J -E[^]>2^ 2 £C-IE[^]} 

3=1 

< p{z 3 - E[Z,] > * • *C v C^ 2 W™ W2n ±} 

t 2 {C v C F 2h) 2 2 log(2M V 2n) 1 
(2C C F 2ie) 2 J 

= J(2M V 2ri)" t2 < J/V2MV2n. 

Finally, we invoke (B.l) to complete the proof of Proposition B.l. □ 
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